fix(coherence): Debugging to Get Qwen Working In Agentic Coding#90
Open
tbraun96 wants to merge 50 commits into
Open
fix(coherence): Debugging to Get Qwen Working In Agentic Coding#90tbraun96 wants to merge 50 commits into
tbraun96 wants to merge 50 commits into
Conversation
…ps the response When the Bug-2 name-run cap (or F11 within-dedup / F5 cross-flush dedup / F44 perm-fail circuit-breaker) forcibly ends a streaming response, `finish_reason` was previously `"tool_calls"` — because tool calls *were* emitted, just truncated mid-loop. Agent clients (opencode and friends) see a normal-looking tool-call completion, dutifully run the tools, send the next request, and the model loops again — Atlas was breaking the loop one round at a time without ever telling the client. Add a `tool_loop_capped: bool` on `StreamState`, flipped true alongside `stop_string_triggered` at every tool-call loop guard (4 sites in `tool_handlers.rs`). `handle_done` reads it and overrides `fr` to `"length"` — OpenAI's spec slot for "response was forcibly truncated" — ahead of the existing `"tool_calls"` / `finish_reason` fall-throughs. This gives every agent client a clean, spec-compliant hook to break its outer retry loop without needing Atlas-specific headers. Also dumped to the `--dump` synthesized-response body for observability. Verified: `cargo check`, `cargo clippy --tests`, `cargo fmt --check` all clean. Live repro will follow once the image is rebuilt. Co-Authored-By: Azeez Ishaqui <debaterishaqui@gmail.com> Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…ppress output The PR #87 fix changed finish_reason to "length" when a tool-loop guard trips, so agent clients can break their outer retry loop — but only when the scheduler actually finalises and emits Done. Live repro on opencode revealed the deeper bug: setting `stop_string_triggered = true` in chat_stream only suppresses *output*; the scheduler keeps generating tokens until natural EOS or `max_tokens`. On a degenerate-loop response (model not EOS-ing), this manifests as a hang — the stream silently consumes tokens, the channel can fill, the scheduler can block on `blocking_send`, GPU goes 0%, no Done event ever fires, opencode sits forever waiting on the SSE stream. Add a cooperative cancellation flag plumbed from chat_stream into the scheduler: Arc<AtomicBool> cancel_flag │ ├── created in chat_stream/mod.rs ├── passed into InferenceRequest::Streaming { cancel_flag, .. } ├── stashed on StreamState (cancel_flag) — chat_stream flips true on: │ • Bug-2 name-run cap trip (handle_complete_tool_call, │ handle_tool_call_end) │ • F11 within-response dedup │ • F44 perm-fail circuit-breaker │ • cross-flush tool_arg_dedup trip │ • loop-watchdog fire (SimHash + substring repeat) └── carried through PrefillInProgress → ActiveSeq on the scheduler side; `emit_step::emit_token` reads it at the top of every token-emit and sets `a.finished = true` if flipped — equivalent to an EOS, so the existing finalize path runs and `handle_done` emits the proper `tool_loop_capped` / `finish_reason="length"` chunks + `[DONE]`. Spill-restored ActiveSeq carries `cancel_flag: None` — the original streaming connection is long gone by the time a swapped-out seq resumes from disk. /v1/completions also passes a fresh never-flipped flag so the scheduler's type-check is satisfied; the guard pipeline doesn't run on that legacy path yet. Verified: `cargo check`, `cargo clippy --tests`, `cargo fmt --check`, `cargo test -p spark-server` (484 passed), `cargo build --release` all clean. Co-Authored-By: Azeez Ishaqui <debaterishaqui@gmail.com> Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…+ opencode)
Layer-D hotfix + Layer-A primary fix (Layer-B watchdog-period extension
deferred to a follow-up — the codegen plumbing for new MODEL.toml
[behavior] keys spans 6 files and Layer-A on its own closes the leak
end-to-end).
**The bug.** With opencode (9 tool schemas, temp=0.3, 16k+ system prompt)
Qwen3.6-35B-A3B emits literal `<tool_call><function=…><parameter=…>…
</parameter></function></tool_call>` XML INSIDE its `<think>` block,
repeated 6+ times before the thinking budget force-closes `</think>`.
The leak reaches opencode's rendered transcript as `reasoning_content`
deltas and visibly contaminates the agentic session. Three-agent
investigation pinpointed six root causes; this patch closes the two
biggest gaps.
**Layer D — `kernels/gb10/qwen3.6-35b-a3b/MODEL.toml`** (hotfix):
`thinking_in_tools = true → false`. With thinking off during tool-active
turns, the `<think>` block never opens and the leak cannot occur.
Re-enable once Layer A is validated end-to-end via opencode.
**Layer A — stream-side scanner with hard cancel** (the real fix):
- `crates/spark-server/src/api/chat_stream/state.rs`: two new fields on
StreamState — `reasoning_xml_scan_buf: String` (≤256-char rolling
tail of decoded reasoning text across deltas) and
`reasoning_xml_leak_detected: bool` (one-shot).
- `crates/spark-server/src/api/chat_stream/handle_token.rs` (thinking
branch only): after the existing per-delta strippers, scan the
cross-delta buffer for `<tool_call>` / `<function=` / `<parameter=` /
`<invoke ` openers (covers Qwen3-Coder, Hermes, MiniMax-XML
variants). On first match:
1. Drop the current delta (no reasoning_chunk emit).
2. Set `state.tool_loop_capped = true` — PR #87's override in
`handle_done.rs:151` then maps `finish_reason` to `"length"`,
which agent clients (opencode, Claude Code, …) treat as
"forcibly truncated" and break their outer retry loop.
3. Set `state.stop_string_triggered = true` for the suppression
path.
4. Set `state.cancel_flag` (PR #89) — the scheduler reads it at
the top of `emit_step::emit_token` and finalises the sequence
within one decode step.
5. `tracing::warn!` with model + request id + last 64 chars of
the scan buffer for postmortem.
Subsequent thinking-phase tokens for the stream short-circuit at the
top of the branch with empty SSE output (handles the in-flight race
between cancel set and emit_token noticing).
Scanner is gated on `ctx.tool_defs_for_backfill` being non-empty
(i.e. tools_active) so a benign chat that genuinely discusses the
`<tool_call>` literal does not trip it.
**Verified.** `cargo check`, `cargo clippy --tests`, `cargo fmt --check`,
`cargo test -p spark-server` (484 passed) all clean. `handle_token.rs`
476 LoC — under the 500 cap. Live replay verification follows image
rebuild against the running dgx1 instance.
Co-Authored-By: Azeez Ishaqui <debaterishaqui@gmail.com>
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Checkpoint of work-in-progress before the Phase A injection-removal
refactor (see bench/qwen36_fp8_dequant_audit/prompt_injection_audit.md
and post_processor_comparison_atlas_vs_vllm.md).
Phase 2 numerical-drift fixes (Qwen3.6-35B-A3B-FP8):
- RNE rounding in f32_to_bf16 (atlas-quant/src/fp8.rs +
spark-model/src/weight_map/fp8_lut.rs)
- __expf softmax in prefill_paged_compute.cuh (polynomial path
retained behind ATLAS_FAST_SOFTMAX_EXP)
- FP16 P x V MMA (10-bit mantissa vs BF16's 7-bit) in
prefill_paged_compute*.cuh
Watchdog rewrites (scheduler/{helpers,decode_logits_*}.rs):
- vLLM-anchored content-loop detector (has_repeating_pattern_anchored)
- disable_watchdogs() env-var kill switch
- mid-word </think> defer via mid_word_token_mask
- describe_content_token_loop diagnostic re-scan on fire
Build infra (atlas-kernels/build.rs + build_target.rs):
- Dedup + std::thread::scope parallel NVCC compile; 7min -> 75sec
(242 unique invocations vs 1169 raw, 4.8x dedup, 20 workers)
MODEL.toml sampler band-aid (all 11 GB10 model dirs):
- Empirical defaults (temp=0.6, pres=0.0, rep_pen=1.1, dry=0.5) that
produced clean opencode sessions. Documented Qwen-card "correction"
regression to anchor future tuning.
- max_inter_tool_prose raised 384 -> 2048 (was killing legitimate
multi-paragraph explanations between tool calls).
Audit artifacts (bench/qwen36_fp8_dequant_audit/):
- prompt_injection_audit.md cataloging 13 always-on prompt mutations
- post_processor_comparison_atlas_vs_vllm.md identifying 5 vLLM wins
- per_model_sampler_recommendations.md research baseline
Co-Authored-By: Azeez Ishaqui <debaterishaqui@gmail.com>
Atlas previously ran a stateful agentic-failure-handling layer that
mutated the user-supplied message array in 13 distinct ways before
tokenization — none opt-in per request, all keyed on heuristics over
message history (audit in bench/qwen36_fp8_dequant_audit/
prompt_injection_audit.md). Phase 2 precision fixes (RNE, __expf,
FP16 P×V, vLLM-anchored loop detection) make these obsolete and they
now train the model to react to XML tags it wasn't trained on.
Deleted injection sites:
1. Tool-parser system_prompt prepend (api/chat/mod.rs:106-118) —
jinja template's `tools` argument is the legitimate scaffolding
2-9. F-feature failure guards (chat_phases::apply_failure_guards
and the entire api/failures/ subtree): F7 stall reminder, F23
progress reminder, F29 environment_facts, F31 hard-refusal,
F32 dup tool_result, F35 failure_recovery, F39 circuit banner,
F49 dup-write banner, F50 original-error append
10. Loop-detector <IMPORTANT> hint append (loop_detect.rs:131-146)
11. task_pin verbatim-goal reminder ("The user's ORIGINAL request
was: «...»") — task_pin.rs deleted, caller in loop_detect.rs
12. observation_mask body rewrite (rewrote past tool/user error
bodies in place with "[stale tool failure N/M: ...]") —
observation_mask.rs deleted, caller in chat/mod.rs
13. Responses-API instructions stacking (responses_lowering.rs):
simplified to spec-faithful insert at pos 0 (no drop-of-prior)
14. /v1/completions `<think></think>\n\n` raw-prompt prefix
(completions.rs:74-78)
Preserved legitimate streaming guards (NOT prompt injections; operate
purely on outbound model text) by extracting them from the deleted
failures subtree into a new api/stream_guards.rs module:
- bump_f12_tool_call_count: per-response tool-call cap
- check_loop_watchdog: post-detector repeating-line detector
- flush_content_sanitizer: stream-end tag-tail drain
Telemetry counters removed alongside their injectors:
- atlas_task_pin_injections_total
- atlas_observation_mask_elided_bodies_total
Net: -3480 LoC (40 files), zero injection sites remaining (verified
zero matches for atlas_runtime_notice / failure_recovery /
environment_facts / IMPORTANT> / stale tool failure / ORIGINAL
request was). cargo check + cargo check --tests both green.
Co-Authored-By: Azeez Ishaqui <debaterishaqui@gmail.com>
…ctionParams
Two small per-request controls ported from vLLM. Phase B-2
(thinking_token_budget) was already present on ChatCompletionRequest
(top-level + chat_template_kwargs paths), so only B-1 and B-3 needed
work.
B-1: stop-string hold-back buffer (vLLM IncrementalDetokenizer pattern)
- StreamCtx.stop_string_buffer_len cached once at request setup as
max(stop_strings.len()) - 1
- StreamState.stop_string_emitted_len tracks forwarded bytes
- handle_token::apply_stop_string_holdback() helper:
- On miss: hold back trailing buffer_len bytes (snapped to UTF-8
char boundary via floor_char_boundary)
- On match: truncate BOTH accumulator and emittable delta at
match position (mirrors vLLM detokenizer.py:140-142)
- handle_done flushes any held-back tail bytes through the active
detector/sanitizer at stream close
- 3 unit tests: chunk-boundary leak prevention, no-stop-strings
zero behavior change, UTF-8 boundary safety
- Hand-tuned <think>-tag boundary deferral untouched
B-3: per-request RepetitionDetectionParams
- New struct {min_pattern_size, max_pattern_size, min_count} on
ChatCompletionRequest + CompletionRequest + InferenceRequest
- Threaded into ActiveSeq, PrefillInProgress, SwappedSeq (survives
swap/restore)
- helpers.rs: new detect_*_token_loop_with(tokens, Option<override>)
variants; legacy detect_*_token_loop() delegates to _with(.., None)
for byte-identical fallback to watchdog_params() constants
- decode_logits_{step,content}.rs pass a.repetition_detection to the
detectors
- Fuzzy + digit-normalized SimHash detectors left untouched
- 3 unit tests for override path (loosen / tighten / thinking-loop)
Test cleanup: detects_fence_body_with_varying_prefixes — was written
for the OLD scan-anywhere detector and broke when we switched to
vLLM-anchored (commit 1bb82ed). Inverted assertion + renamed to
rejects_*: anchored detector intentionally does not fire on this
varying-prefix pattern; the rollback-to-boundary machinery catches
the resulting tighter loop one layer up.
cargo check + cargo check --tests + clippy all green.
Co-Authored-By: Azeez Ishaqui <debaterishaqui@gmail.com>
Live opencode session (opencode-phaseAB.jsonl, 2026-05-24) on the
post-PhaseB image surfaced two distinct user-visible failures:
1. **Content-loop watchdog firing on legitimate JSON tool-call body.**
`CONTENT_LOOP_MIN_REPEATS=2` + `PERIOD_MIN=2` meant the watchdog
fired on FOUR matching end-anchored tokens. JSON structural
punctuation (`","`/`":"`) naturally forms period-2 patterns
inside tool-call bodies — observed 18:13:18 firing at
content_tokens=48 inside a bash() invocation, ending the response
with `reason=NoBoundary` (rollback declined because mid-grammar).
The tool-salvage path recovered a garbled `bash({"command":""})`.
Fix: `CONTENT_LOOP_MIN_REPEATS` 2 → 3. Matches vLLM's
`RepetitionDetectionParams.min_count` default. Period-2 with 3
repeats now requires 6 matching end-anchored tokens — still
catches genuine `[A, B]` attractors (~100ms after onset), but
tolerates the `","`/`":"` JSON-punctuation band.
Per-request `repetition_detection` (Phase B-3) still wins over
this default, so clients with stricter or looser needs can
override.
2. **2048-token thinking budget burned on agentic turns.**
Two failures traced to `max_thinking_budget=2048` (qwen3.6 MODEL.toml):
- seq=1 (title-gen, 0 tools): 2068 thinking tokens for a 7-token
title.
- seq=8 (agentic turn 9): full 2048 budget exhausted, model
force-closed via `</think>` injection, then emitted EOS with
zero content tokens. 33s of stall.
Fix: `qwen3.6-35b-a3b/MODEL.toml.max_thinking_budget` 2048 → 768.
Bounds worst-case thinking burn at ~30s. Typical agentic
thinking is 200-500 tokens; 768 leaves headroom. Clients can
pass `thinking_token_budget` per-request (Phase B-2) for chat
workflows that genuinely need more.
Tests updated: `content_loop_accepts_two_repeats` → `accepts_three_repeats`
+ added negative `rejects_two_repeats`; `override_loosens_content_loop_threshold`
updated for new default (3-repeat fires, 4-repeat doesn't).
Co-Authored-By: Azeez Ishaqui <debaterishaqui@gmail.com>
Live opencode session on `atlas-gb10:hotfix` (opencode-hotfix.jsonl
seq=10, 2026-05-24): after Atlas's tool-validation correctly rejected
a `write({})` call with empty filePath, the model degenerated into a
doom loop emitting partial `<tool_call>` envelopes that never
closed. The streaming sanitizer correctly detected the orphan opener
and flipped `suppressing_param_leak=true`, then drained the
subsequent content waiting for a close — but no close arrived for
8192 tokens (max_tokens limit). The content-loop watchdog couldn't
catch this because partial-envelope periods exceed
`CONTENT_LOOP_PERIOD_MAX=64`. User-visible result: 116s of "hung"
streaming with finish=length, content="" (sanitizer suppressed all
of it).
Fix: add `suppress_streak_tokens: u32` to `StreamState`. Per-token
on the chat_stream path, increment when `suppressing_param_leak` is
true at end-of-token; reset when it flips false. When the streak
exceeds `MAX_SUPPRESS_STREAK_TOKENS=256`, trip the same kill switch
the content-loop watchdog uses (`loop_watchdog_triggered=true`,
`stop_string_triggered=true`, `cancel_flag.store(true)`). 256 tokens
is enough headroom for legitimately long tool-call bodies (large
`content` strings on a `write` call) while bounding worst-case
wasted decode at ~10s @ 30 tok/s.
The 8192-token doom loop now ends in ~10s with finish_reason=length
and a tracing warn line carrying the streak count, so future
occurrences are self-debuggable.
Co-Authored-By: Azeez Ishaqui <debaterishaqui@gmail.com>
Scaffold the `LogitsProcessor` trait + 8 processor implementations
from the inline pre-sample masking block in
`decode_logits_seq::process_seq_logits`. **Not yet wired** — this
commit only introduces the new module; the inline block in
`process_seq_logits` still runs the same logic. Part 2 will replace
the inline block with `run_pipeline(...)`.
Module layout (`crates/spark-server/src/scheduler/logit_processors/`):
- `mod.rs`: `LogitsContext`, `ProcessorOutcome` enum, `LogitsProcessor`
trait with `apply(&mut [f32], &mut ActiveSeq, &LogitsContext) ->
ProcessorOutcome`, `name()`, `is_argmax_invariant()`, and the
`run_pipeline(...)` driver wiring the eight stages in this order:
1. F2ConfidenceEarlyStop
2. MidWordThinkEndMask
3. PostCloseThinkMask
4. ToolCallDuringThinkingMask
5. ForcedThinkEndInjector
6. PinToToolCallStart
7. ForcedTokenFastPath ← may short-circuit (EmitToken)
8. GrammarBitmaskApply
- One file per processor (~40-80 LoC each) preserving byte-identical
gate semantics from the monolith:
f2_confidence.rs / mid_word.rs / post_close.rs /
tool_during_think.rs / forced_think_end.rs / pin_tool_call.rs /
forced_token.rs / grammar_bitmask.rs
- `pipeline_tests.rs` (cfg=test): stage-name stability + distinctness,
`is_argmax_invariant` advertisement matrix, `should_inject_think_end`
truth table, defer-override constants pin. Integration replay
against the wired pipeline lives in opencode-session.md once
part 2 lands (ActiveSeq is too complex to fixture cleanly).
Out of scope (intentional): AdaptiveSamplingState entropy observation
runs after this pipeline — it's a sampling-policy decision, not a
logit transform. The final `sample_with_params_history` call is also
downstream. Both stay in `process_seq_logits` after the wire.
Co-Authored-By: Azeez Ishaqui <debaterishaqui@gmail.com>
… fall-through Live opencode-hotfix2.jsonl seq=4 (2026-05-24 19:00→19:02): the hotfix-2 streak watchdog DID NOT fire on an 8215-token doom loop (content="", finish=tool_calls, max_tokens=8192 effectively burned). Root cause: `handle_token` has 8+ early-return branches; the streak check was at the natural fall-through, so most token paths skipped it entirely. Fix: split `handle_token` into a thin wrapper + `handle_token_inner` body. The wrapper runs the streak update + kill check AFTER the inner body returns, so every token path increments/resets the counter and trips the kill switch when warranted. Body now has the same early-return shape; only the watchdog block moved. The 256-token streak bound was correct; just wasn't reachable. Co-Authored-By: Azeez Ishaqui <debaterishaqui@gmail.com>
ROOT CAUSE found 2026-05-24 (opencode-hotfix2b.jsonl seq=13, 8222 tokens, finish=length, content="", reasoning=29): `process_decode_logits` is the non-MTP decode path. It calls `handle_content_token` which contains the content-loop watchdog (period-2…64 detector, MIN_REPEATS=3). All the prior watchdog tuning applies HERE. `verify_k2_step` / `verify_k3_step` / `mtp_step` / `spec_step` are the MTP / speculative-decode paths. They call `emit_token` directly, which only bookkeeps the emitted token (push to output_tokens, advance grammar, EOS check) — it does NOT run any watchdog. When `--speculative` is on (which is on by default for this build since `--speculative --mtp-quantization bf16` are in the docker command), essentially every token goes through the MTP path. The content-loop watchdog never fires on MTP-emitted tokens. The recently-tuned MIN_REPEATS=3, the per-request RepetitionDetectionParams plumbing, and the rollback machinery — all unreachable for the common case. Observed live: model degenerated into a period-4 `[29, 198, 510, 15704, …]` attractor (the `parameter>\n` doom-loop from the project's prior memory note). 8193 content tokens of pure repeat, zero watchdog fires, response burned to max_tokens=8192. Fix: mirror the content-loop watchdog check inside `emit_token` for the `!inside_thinking` branch. Increments `a.content_tokens`, re-uses the same detector functions and gate constants (`disable_watchdogs`, `enable_loop_watchdog`, `CONTENT_LOOP_*`, `detect_content_token_loop_with`, normalized variant). On fire: set `a.finished = true` and warn. Skip rollback in this mirror — `emit_token` doesn't take `&dyn Model` (the SSM rewind requires it) and plumbing it through the 14+ MTP/spec call sites would be a separate diff. End-of-stream is a strict improvement over the current 8192-token burn; rollback can be added later if needed. The non-MTP path retains rollback via `handle_content_token` unchanged. Co-Authored-By: Azeez Ishaqui <debaterishaqui@gmail.com>
…aths
Closes the root-cause loop the team-debug investigation identified
(see bench/hotfix3-debug/SYNTHESIS.md): MTP K=2/K=3/K=4/dflash verify
paths were returning raw GPU argmax with NO pre-sample logit
processors applied. Every MTP-emitted token escaped grammar_bitmask,
mid_word_mask, post_close_mask, tool_during_think_mask,
forced_think_end_injector, pin_to_tool_call, forced_token_fastpath,
and f2_confidence_early_stop — causing grammar desync, malformed
tool calls (empty filePath, invented 'description' tool names),
mid-word </think> cuts, stray <think> re-entry, and the
prompt_tok>13k cliff documented in opencode-hotfix3.jsonl.
## Non-MTP path (P1a)
`decode_logits_seq::process_seq_logits`: 435 → 185 LoC. Replaced the
inline 200-line 8-stage masking block with a single
`run_pipeline(&mut f32_logits, a, &ctx)` call. `LogitsContext` carries
think_start/end and tool_call_start/end tokens; pipeline driver
short-circuits on ForcedTokenFastPath::EmitToken. Dequant block,
adaptive-sampling/sample_with_params_history, and
extract_logprobs_from_f32 unchanged.
Function-signature changes: `_think_start_token` and
`_tool_call_end_token` lost the underscore prefix — pipeline
processors consume them. Sole caller (`decode_logits_step.rs:82`)
already passes by name; no other call-site changes.
## MTP verify paths (P1b)
New helper: `crates/spark-server/src/scheduler/verify_pipeline_helper.rs`
(148 LoC):
- `verify_pick_with_pipeline(logits_bytes, ...)`: dequant + run_pipeline
+ argmax for ONE verify position (~0.4ms/position @ 256k vocab)
- `verify_pick_all_with_pipeline(model, argmax_ids, ...)`: copy
[K, vocab] BF16 verify logits D2H, process all K positions in
sequence; falls back to raw argmax on D2H failure
Wired into:
- `verify_k2_step` (k=2 mainline)
- `verify_k3_step` (k=3)
- `verify_k4_step` (k=4)
- `verify_dflash_step` (dflash spec-decode)
- `spec_step::step_self_spec` (self-speculation)
- `spec_step::step_ngram_verify` (ngram k=2)
- `mtp_step` (Phase A bootstrap + Phase B verify dispatch)
LogitsContext is built once per scheduler tick in `scheduler/mod.rs`
and threaded through `step_mtp`/`step_self_spec`/`step_ngram`.
Drafts are left alone — they already go through grammar bitmask
inside `run_mtp_propose_multi` per `mtp_head/forward.rs:382-463`.
GPU verify still bakes argmax inside the CUDA graph; the new path
adds a one-shot D2H copy (~K × 0.4ms) which is NOT graph-captured
(constraint from the helper design). Reuses
`model.copy_logits_to_host()` infrastructure rather than touching
`verify_b.rs` / `verify_c.rs` — additive, old argmax path remains for
non-pipeline call sites if needed.
## P2: restore `!a.inside_tool_body` gate on content-loop watchdog
Both `handle_content_token` (decode_logits_content.rs) and the
hotfix-3 MTP mirror in `emit_step.rs` now skip the watchdog inside
the tool body. xgrammar guarantees structural termination there;
repeated rejected drafts that look like a period-N attractor were
producing false-positive amputations of valid JSON. The
`parameter>\n`-style real loops are still caught the instant the
model exits the tool body.
## P3: non-silent gs.accept_token failure
`emit_step.rs::emit_token` now captures `gs.accept_token(tok)` return
value. On `false` (token violates grammar — xgrammar NPDA refused),
warn + `a.finished = true`. Today's silent desync corrupted all
downstream bitmasks for the rest of the response. After P1a+P1b,
this should rarely fire — defense-in-depth.
## P4: K2/K3/K4 ACCEPT/REJECT log fix
Replaced `seq_len.is_multiple_of(50)` gates (which logged every
REJECT but only every 50th ACCEPT, misleading the team-debug
investigation into thinking accept rate was 0.32% when actual was
41.5%). Now per-100-step summary `info!` with
accepts/rejects/rate via `AtomicU64` counters; per-step events
demoted to `debug!`.
## P5: flush `reasoning_tag_scan_buf` on `</think>`
`handle_token.rs` previously dropped up to ~18 trailing bytes of
every thinking block on `</think>` fire — the sanitizer's tail
buffer was never flushed. Added an explicit flush block (skipped
when reasoning-side suppression is active).
## Out of scope
Model-side FP8 KV precision drift at long context ("axut"/"withcurl"
spelling errors documented in findings-agent-C-tokenizer.md) is a
separate Phase 2c concern. The pipeline fix above does NOT address
those — they're emitted by the model, decoded correctly, rendered
correctly. See project_qwen36_phase2b_softmax_expf.md.
## Verification
- `cargo check --tests -p spark-server`: clean (0 errors, 0 warnings)
- `cargo clippy --tests -p spark-server`: clean (0 lints)
- `cargo check --workspace --tests`: clean
Co-Authored-By: Azeez Ishaqui <debaterishaqui@gmail.com>
Live-test on `atlas-gb10:realfix` showed every response ending at 30-60
content tokens with `gs.accept_token returned false — xgrammar NPDA
refused the emitted token` (P3 kill-switch) firing on tok=198 (\n) or
tok=29 (structural punctuation). Root cause: in
`verify_pick_all_with_pipeline`, the pipeline was applied to each of
the K verify positions independently, but the xgrammar matcher state
was NEVER ADVANCED between positions — position 1's bitmask was
computed against position 0's matcher state.
For K=2 spec decode:
1. Position 0: bitmask correct (matches current matcher state)
2. Position 1: bitmask STALE — matcher hasn't seen pick[0] yet
3. Pipeline argmax for position 1 picks a token valid at position 0's
state but possibly invalid at position 1's state
4. emit_token(drafts[0]) → accept_token advances matcher → state now
at position 1
5. emit_token(verified[1]) → accept_token(pick[1]) — pick[1] is not
in this state's bitmask → returns false → P3 ends response with
finish_reason=length
Example: token 198 (`\n`) is legal at JSON-value-start (post-`:` in
`"key": <value>`) but illegal at JSON-comma-or-closebrace (post-value
in `"key": "value"`, where the matcher expects `,` or `}`). The bonus
token sampled at position 0's state landed on `\n`, but the real
position-1 matcher state demanded `,` or `}`.
Fix: between each verify position, speculatively `gs.accept_token(pick)`
to advance the matcher, then `gs.rollback(grammar_advances)` at end of
the helper. The real `emit_token` calls (which run after this helper
returns) re-advance the matcher normally based on which tokens the
scheduler actually accepts/rejects, so the speculative advance must be
reverted to avoid double-advancement.
Defensive check on speculative `accept_token` failure: if the pipeline
ever returns a token outside the current bitmask (would indicate a
real bug elsewhere — pipeline correctly applies the mask so this
shouldn't happen except after grammar termination from a forced-token
fastpath), log a `debug!` and stop speculating. The real emit_token
will still fire P3 if there's a genuine desync — better visibility,
no silent corruption.
Updated module docstring with the new contract + a "DO NOT REMOVE"
note pinning the fix against future "simplification" reverts.
Co-Authored-By: Azeez Ishaqui <debaterishaqui@gmail.com>
…nsion eliminated
Ran 18 KV cache configurations against the standard 18920-token probe
on Qwen3.6-35B-A3B-FP8, comparing per-layer hidden state cosine vs HF
BF16 reference. Configs cover all 6 supported KV dtypes (Fp8, Bf16,
Nvfp4, Turbo4, Turbo3, Turbo8) × varying --kv-high-precision-layers
× varying --fp8-kv-calibration-tokens.
## Findings (full table in VERDICT-final.md)
1. **8 configs gave bit-identical 0.9605 mean cosine** — bf16-all,
bf16-hp-max, fp8-hp{5,10,max}, nvfp4-hp{5,10,max}. The
chunked-prefill kernel dequants K/V back to BF16 before attention,
so the prefill cosine bench is insensitive to storage dtype. KV
cache choice does NOT move the precision needle for this probe.
2. **3 calibration configs gave bit-identical 0.9175** —
--fp8-kv-calibration-tokens 64, 256, 512 all produce the same
result. Calibration is either no-op for this probe or has a
different code path that uniformly degrades.
3. **Turbo8 broken on Qwen3.6 hybrid arch** — produces all-NaN hidden
states from L19 onward (first SSM-attention-interleave failure
beyond the 4 working full-attention layers). Validated only on
MiniMax M2.7's pure-attention layout per kv_cache.rs:132.
4. **Per-layer diff vs May 23 rne baseline** — current fp8-baseline
is 0.5-1% worse than the May 23 rne dump at most layers. Confirms
the project_qwen36_phase2b_softmax_expf memory note: __expf
unmasked a deeper precision bug the polynomial was smoothing.
## Conclusion
KV cache is exonerated. The ~4% gap to HF reference lives in
compute, specifically in the post-Phase-2b kernel surface (RNE +
__expf + FP16 P×V). Day 2 starts with bisecting which kernel change
unmasked the regression, then per-kernel cosine bisection within
each layer to localize.
Files preserved:
- bench/phase2c-kv-sweep/runner.sh — config sweep automation
- bench/phase2c-kv-sweep/VERDICT-day1.md — initial draft
- bench/phase2c-kv-sweep/VERDICT-final.md — full 18-config analysis
Co-Authored-By: Azeez Ishaqui <debaterishaqui@gmail.com>
Bisect infra (compile + runtime gates for future precision tests): - ATLAS_EXTRA_NVCC_FLAGS env var threaded into build_target::NvidiaTarget::compile — appends arbitrary nvcc args (typically -D<MACRO>=1) to every kernel compile. Cargo rerun gated via rerun-if-env-changed=ATLAS_EXTRA_NVCC_FLAGS. - #ifdef ATLAS_DISABLE_FP16_PV in prefill_paged_compute.cuh — both HDIM=256 and HDIM=512 paths. When defined, reverts smem_P/P64 to __nv_bfloat16, P stores to __float2bfloat16_rn, V reads to direct bf16 packing, and the MMA op to .bf16.bf16. Companion bisect knob to the existing ATLAS_FAST_SOFTMAX_EXP gate (which already flips sw_exp polynomial vs __expf). - ATLAS_DISABLE_RNE=1 runtime env-var in atlas-quant::fp8::f32_to_bf16 and spark-model::weight_map::fp8_lut::f32_to_bf16 — reverts the Phase 2b RNE patch to plain truncation (the pre-Phase-2b behavior). No rebuild needed once the gated binary is shipped. Day 2 bisect results (full table in VERDICT-day2.md): | Config | mean | min | Δ vs baseline | |---|---|---|---| | fp8-baseline (current) | 0.9615 | 0.9179 | — | | -DATLAS_FAST_SOFTMAX_EXP=1 | 0.9617 | 0.9184 | +0.0002 | | -DATLAS_DISABLE_FP16_PV=1 | 0.9619 | 0.9172 | +0.0004 | | ATLAS_DISABLE_RNE=1 | 0.9569 | 0.8990 | -0.0046 | | rne (May 23 reference) | 0.9668 | 0.9290 | +0.0053 | None of the three Phase 2b kernel changes individually accounts for the 0.5% cosine gap vs the May 23 rne reference. RNE is actively helping (disabling it regresses by 0.5%). Both __expf and FP16 P×V are effectively neutral. The rne reference dump's filesystem timestamp predates any docker image currently on disk by 17+ hours, so its lineage is unclear — likely a pre-Phase-2 baseline whose 0.5% gap to current realfix2 reflects accumulated infrastructure changes outside the Phase 2b kernel surface. Direct precision comparison is partly invalid. Day 3 attack vectors: per-sub-step cosine within each layer, MoE expert-routing divergence audit, NVFP4 weight checkpoint comparison. Co-Authored-By: Azeez Ishaqui <debaterishaqui@gmail.com>
Days 1+2 falsified the FP8 KV cache hypothesis. Day 3 finds the actual answer: the model degeneration is in the FP8 WEIGHT quantization, not Atlas's compute or KV cache. Same Atlas (atlas-gb10:realfix2), same probe, swap the weight checkpoint from `Qwen/Qwen3.6-35B-A3B-FP8` to `RedHatAI/Qwen3.6-35B-A3B-NVFP4`: | Aspect | FP8 | NVFP4 | |---|---|---| | Throughput (short) | ~65 tok/s | 117 tok/s (1.8×) | | TTFT | ~140ms | 72ms (½) | | TOML quotes | single (invalid) | double (valid) | | Token glue artifacts | "withcurl"/"axut" | none | | Language switching | Indonesian descriptions | none | | Code coherence | garbled | valid Rust+axum | | Deep-context (9780 prompt) tool call | empty filePath, garbage args | "rm -rf … && mkdir -p …" valid shell | | Watchdog amputations | frequent | none | | Long-response finish | doom-loop "length" | natural "stop"/"tool_calls" | Atlas's compute pipeline is correct. NVFP4's per-block FP8 scales × 4-bit E2M1 nibbles give 16× finer per-element precision than dense FP8 E4M3, holding late-layer K/V and MoE weights within reasonable bounds where FP8 hits its E4M3 dynamic-range cliff and quantizes coarsely. Project memory's note "NVFP4 KV best at deep, FP8 best early" was correct — applied to WEIGHTS, not KV storage. Verdict: production deployment of Qwen3.6-35B-A3B should use the NVFP4 checkpoint, not FP8. atlas-gb10:realfix2 works directly with RedHatAI/Qwen3.6-35B-A3B-NVFP4 — no rebuild needed. Recommended invocation in bench/phase2c-kv-sweep/VERDICT-day3.md. Also adds: bench/phase2c-kv-sweep/compare-configs.py — multi-config cosine comparator used throughout the bisects. Co-Authored-By: Azeez Ishaqui <debaterishaqui@gmail.com>
Live NVFP4 testing (opencode-nvfp4.jsonl seq=15) hit a doom-loop distinct from the hotfix-2b-handled cases: model emitted a `<tool_call>` opener but never reached a matching `</tool_call>` close. 8221 tokens of buffered tool-call args, all suppressed by the sanitizer's orphan detection at the final flush, finish=length, content="". User saw the request as "trapped" while it burned to max_tokens=8192. The hotfix-2b suppression-streak watchdog couldn't catch this because `suppressing_param_leak` only flipped at the FINAL flush (orphan detection ran at stream end, not during emission) — so the streak counter never accumulated. Fix: new ActiveSeq.tool_body_streak_tokens counter incremented in emit_step::emit_token while a.inside_tool_body=true. Resets on `<tool_call>` open or `</tool_call>` close. When the streak exceeds MAX_TOOL_BODY_TOKENS=1024, set a.finished=true. 1024 is enough headroom for legitimate long tool-call bodies (large `content` field on a `write` call) while bounding worst-case wasted decode at ~15s @ 65 tok/s — vs the ~95s burn at max_tokens= 8192 we observed. Companion to: - hotfix-2b (suppress_streak_tokens) — catches orphan opener detected DURING the stream - content-loop watchdog (decode_logits_content + emit_step) — catches period-N attractors OUTSIDE tool body - max_inter_tool_prose (MODEL.toml) — caps free prose BETWEEN tool calls Co-Authored-By: Azeez Ishaqui <debaterishaqui@gmail.com>
User-prompted forensic audit of "every logic gate that picks the WRONG
quantization for FP8 weights when canonical FP8 paths exist."
Key finding: the initial assumption ("all FP8 weights re-quantized to
NVFP4") was wrong. Atlas's full-attention QKVO and routed MoE experts
DO run on canonical FP8 kernels for Qwen3.6-35B-A3B-FP8 (via
`set_fp8_experts` / `set_fp8_weights` in qwen35 loader). The
`quantize_to_nvfp4` boot-log lines come from a SMALLER set of weights.
5 bugs identified, ranked by severity:
1. HIGH — SSM decode is always NVFP4
File: weight_loader/qwen35/load_layers/linear_attn_arms.rs:176-190
30 of 40 layers (linear attention) decode through NVFP4-from-FP8
weights. The FP8-native arm `build_linear_attention_fp8` exists
in the same file (line 24) but is "permanently short-circuited"
per the comment at load_layers.rs:334-342. The kernels and
set_fp8_weights plumbing for Qwen3SsmLayer all exist.
Likely the actual source of the "deep-layer drift" memory entries
(project_qwen36_phase2b_softmax_expf, project_qwen36_drift_gdn_clean).
Drift accumulates through the SSM stack and corrupts late
full-attention layers L31-L39.
Fix: route variant==Fp8Dequanted to build_linear_attention_fp8;
remove parallel _nvfp4 build. Medium difficulty.
2. MED — Dead NVFP4 shared expert built but never consumed
(ssm_qwen35.rs:184). Memory waste, no correctness impact.
3. MED — MoE router gate loaded via `dense` (raw bytes, no dtype
conversion) then quantize_to_nvfp4'd. If gate is FP8 in
checkpoint, the bytes are read as BF16. Fix: dense → dense_auto.
4. LOW — LM head same pattern. Currently dormant (Qwen3.6 FP8
leaves lm_head BF16) but latent for any future FP8 checkpoint
that quantizes the head.
5. LOW — MTP head double-quantizes via quantize_to_nvfp4 for all
projections under MtpQuantization::Nvfp4. Default should be
Fp8 when native_fp8.
Recommended single fix: Bug #1 (re-enable FP8-native SSM decode).
The required kernels and plumbing exist; only the dispatch needs
re-wiring.
Co-Authored-By: Azeez Ishaqui <debaterishaqui@gmail.com>
Tried to re-wire LinearAttention to use `build_linear_attention_fp8` for `Nvfp4Variant::Fp8Dequanted`. Build succeeded, container crashed at layer 0 load with `cuMemcpyDtoDAsync_v2 status 1` (INVALID_VALUE). Root cause (deeper than the CAUSAL-PATHWAY-AUDIT spotted): the `Fp8Weight` struct's `row_scale` field documents itself as "`[N]` f32 per-row dequant scale" (32 KB for N=8192). But `load_fp8_block_scaled_as_fp8weight` populates that same field with per-BLOCK BF16 scales of shape `[N/BS, K/BS]` (= 2 KB for N=8192, K=2048, BS=128 → 64*16*2 bytes). `build_linear_attention_fp8` then concatenates QKV+Z scales via: gpu.copy_d2d(qkv_fp8.row_scale, qkvz_scale_ptr, qkv_rows * 4) which asks CUDA to copy 32 KB from a 2 KB source. CUDA correctly returns INVALID_VALUE. The downstream `fp8_gemm_n128` kernel that decode_batched calls also doesn't take any scale argument — it expects single-scale FP8 (produced by `bf16_to_fp8` from a BF16 dense), not the block-scaled FP8 the checkpoint ships. So even if the concat were fixed, the math would be wrong (FP8 values un-rescaled by their per-block multipliers). Reverting the dispatch back to NVFP4-only for LinearAttention. Updated comments at both sites to document why the FP8 arm stays dead-coded and what kernel work would be required to revive it. The CAUSAL-PATHWAY-AUDIT.md Bug #1 finding stands but is *open* — the right fix is to rewrite the FP8 SSM GEMV/GEMM chain to consume the per-block scales the checkpoint provides, not to flip a dispatch. In the meantime, the production workaround is to serve the NVFP4 checkpoint (`RedHatAI/Qwen3.6-35B-A3B-NVFP4`). Co-Authored-By: Azeez Ishaqui <debaterishaqui@gmail.com>
…parser Multi-iteration work toward clean opencode multi-turn agentic sessions on Qwen3.6-35B-A3B-FP8. The combined image (atlas-gb10:fp8-much-better) ships: - Native FP8 SSM dispatch (block-scaled `w8a16_gemv` decode + single-scale `fp8_gemm_n128` prefill) — no NVFP4 detour. WeightQuantFormat enum tags Fp8Weight buffers with `Fp8BlockScaled`/`Fp8PerRow`/`Fp8SingleScale` so kernel-call sites can assert format and fail-fast on silent quant leaks. Concat math for QKVZ along the N-block axis writes `(qkv_rows/BS)*(K/BS)*2` bytes of BF16 scales per row, then `(z_rows/BS)*(K/BS)*2` — both Nq and Nz are required to align to BS=128 by the on-disk Qwen FP8 format. Out-proj is loaded block-scaled directly. - Streaming detokenizer fix: content phase now uses the full-decode + slice approach (matching the reasoning path). HF tokenizers' DecodeStream drops the leading metaspace byte at certain BPE-token boundaries for byte-level tokenizers like Qwen's, producing `name = test-rust-axum-v32version` in streamed `<parameter=content>` while non-streaming `tokenizer.decode(&all_toks)` yields the correct multi-line TOML. `tokenizer.decode(&state.all_toks)` + `full[state.emitted..stable_end]` is byte-exact vs non-streaming for both reasoning and content (verified live with temp=0 seed=42). - MoE gate stays BF16 for native_fp8: the `[num_experts, h]` BF16 gate is in the FP8 release's `ignored_layers`; runtime-quantizing it to NVFP4 (4-bit) destroys the precision the router needs at late layers where the top-8 weights cluster in `[0.105, 0.168]`. Non-FP8 variants keep the runtime NVFP4 quantize. - Whitespace-only stream chunks are no longer dropped — `cleaned.is_empty()` instead of `cleaned.trim().is_empty()`. The `sanitize_content_chunk` holdback rollout commits leading `\n ` indent as a chunk; previously the trim filter dropped that chunk and `state.emitted` advanced past the unemitted bytes, permanently losing indent. - PR 73's qwen3_xml parser + schema-driven type coercion cherry-picked (qwen3_xml.rs, type_coerce.rs, group_e.rs tests, registration in tool_parser.rs). Live opencode v40 showed qwen3_xml itself confused the model under thinking_in_tools=true + long opencode prompt, so MODEL.toml defaults back to qwen3_coder (auto-detect from model_type) — but qwen3_xml stays in the binary as a selectable parser for downstream consumers who want the type-coercion behavior. - post-think EOS guard scoped to `require_tool_call`. Was unconditional POST_THINK_MIN_CONTENT=16 — forced short answers like "4" or "Paris" to keep generating into chat-template artefacts (`\nuser\nassistant`). Now only suppresses EOS when tools are required, so plain-text answers let `<|im_end|>` fire as expected. - MODEL.toml `thinking_in_tools = true` (matches user's "MUCH better" recall); `tool_call_parser` default left to auto-detection so qwen3_coder fires under `model_type = qwen3_6_moe`. All WeightQuantFormat constructors tagged. cargo check / cargo clippy green workspace-wide. Co-Authored-By: Azeez Ishaqui <debaterishaqui@gmail.com> Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
|
|
||
|
|
||
| def fmt(x: float) -> str: | ||
| if x != x: # NaN |
| for h in headers[1:]: | ||
| deltas = [data[h][i] - base[i] for i in range(40)] | ||
| mean_d = float(np.nanmean(deltas)) | ||
| max_d = max(deltas, key=lambda x: abs(x) if x == x else 0) |
TheTom
added a commit
to TheTom/atlas
that referenced
this pull request
May 25, 2026
Applies the FP16 P×V MMA upgrade + __expf softmax replacement from Avarok-Cybersecurity#90 (fix/in-think-tool-call-leak) to the asym prefill kernel that this PR introduced, and pulls the same fix into the upstream symmetric kernels we carried forward unchanged. Motivation (per @tbraun96's PR Avarok-Cybersecurity#92 review + Discord context): the prior `sw_exp` polynomial advertised ~1e-4 max relative error, but verifies at ~5.6e-3 (~0.5%) against `torch.exp`. Across 18920-token attention rows × 10 full-attention layers, that compounds to measurable per-layer cosine drift vs HF reference. The FP16 P×V MMA upgrade trades ~10% prefill slowdown for ~8× higher mantissa precision on the softmax probabilities (P), which is the dominant remaining attention-output drift source on Qwen3.6-35B-A3B-FP8. Files: - kernels/gb10/common/prefill_paged_compute.cuh — direct cherry-pick of the Phase 2c kernel changes + ATLAS_DISABLE_FP16_PV bisect toggle (matches PR Avarok-Cybersecurity#90 byte-for-byte in the fix regions). - kernels/gb10/common/prefill_paged_compute_512.cuh — same sw_exp refactor for the HDIM=512 path. - kernels/gb10/common/prefill_paged_compute_asym.cuh — TQ+ asym fork carries the same precision bug; applied the equivalent fix (helper + __half smem_P/P64 + __float2half_rn + .f16.f16 MMA). Skipped the ATLAS_DISABLE_FP16_PV debug toggle for now (can add later if the team wants to bisect asym paths separately). Q×K stays BF16 (range matters there); P×V becomes FP16 (precision matters, range is bounded [0,1] post-softmax). All bf16 stores now use __float2bfloat16_rn for RNE rounding. Verified locally: nvcc 13.0 compiles all 114 kernels clean on sm_120 including the TQ+ asym variants (bf16k_turbo3v, fp8k_turbo3v, etc).
`compile_qwen3_coder_tool_grammar` was forcing the tool-call body to
`json_schema` content type, but the qwen3_coder wire format the model
emits — and the parser at `parse_single_b.rs:60-132` expects — is
native XML (`<parameter=KEY>VALUE</parameter>`), not JSON. When the
FSM was in a JSON-string state and the model wanted to emit a single
BPE token that decoded to bytes containing `<parameter=` (or any
non-JSON-string-class character), xgrammar masked the entire token
and forced a near-substitute. The result, live in opencode multi-turn
agentic sessions on 2026-05-25:
/tmp/test-rust-axum-v42 → /tmp/test-rust-axu-v4 (drops `m`, drops `2`)
/tmp/test-rust-axum-v42 → /tmp/test-rust-axu-m-v4 (drops `2`, inserts `-`)
/tmp/test-rust-axum-v42 → /tmp/test-rust-axum-v01 (4→0, 2→1)
"filePath":"]}]}]}}}" (JSON-delimiter cascade)
Switching body to `any_text` keeps the OUTER `<tool_call>\n<function=…>\n
…\n</function>\n</tool_call>` framing constrained by xgrammar (begin/end
fixed) while leaving the body bytes unconstrained, so native XML
`<parameter=>` blocks pass through intact. Schema validation remains:
`validate_single_tool_call` + `backfill_required_params` run host-side
after `parse_one_call` (`tool_handlers.rs:46`), catching any actual
schema violations after the fact. Mirrors MiniMax's grammar pattern at
`compile_minimax_xml_tool_grammar` line 472 which has used `any_text`
for the same native-XML reason.
Live verification (atlas-gb10:fp8-grammar-any-text, opencode axum v43):
- v41 (nvfp4-KV + hp5): 0 files persisted (path corrupted)
- v42 (fp8-KV uniform): 0 files persisted (path corrupted)
- v43 (this commit): 20 files persisted, byte-exact
`/tmp/test-rust-axum-v43/Cargo.toml`,
full cargo new skeleton on disk.
Investigation thread that landed here:
- 40-layer cosine study on dgx2 ruled out model precision drift
(Atlas[FP8-native] vs HF[FP8→BF16] mean cos 0.99497, min 0.99012).
- KV format swap (v41 nvfp4-hp5 vs v42 fp8-uniform) ruled out KV.
- Six parallel Explore agents on dgx1 ruled out:
* HF DecodeStream (audit found no live call sites; 8d2cc87 was clean).
* Sampler penalties in tool body (`inside_tool_body` flips correctly
on token 248058; rep_pen/DRY/freq/presence zeroed in body).
* Stop-string holdback (char-boundary safe, flushed on done).
* Mistral whitespace-skip char-boundary issue (latent, separate bug).
* Sanitizer `state.emitted` cursor desync (real content-phase bug
to fix separately; not the tool-arg path).
- Three web-research agents pinpointed vLLM PR #35615 (Qwen3Coder
spec-decode parameter loss) but reading Atlas's `streaming_impl.rs`
showed Atlas already buffers until `</tool_call>` and parses the
full inner via `parse_one_call`, so the close-before-drain pattern
is structurally absent here. The remaining hypothesis — grammar
bitmask over-constraining valid BPE tokens at the JSON-schema
body — survived; this commit confirms it.
Tests updated (`grammar/tests/qwen3_coder_required.rs`): the previous
suite pinned "grammar REJECTS empty body" — a property of the prior
`json_schema` body type. With `any_text`, required-param enforcement
moves host-side to `validate_single_tool_call`. New tests pin the
envelope shape and accept BOTH native XML and legacy JSON bodies
(the parser supports both via its JSON fallback at parse_single_b.rs:137).
Co-Authored-By: Azeez Ishaqui <debaterishaqui@gmail.com>
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
9 tasks
Mirrors F78 (Write family non-empty path) for shell-execution tools. Without this, the qwen3_coder `any_text` body grammar (commit e99159d) accepts an immediately-closed parameter `<parameter=command></parameter>`, backfill_required_params then inserts `""`, validator passes, opencode returns "The argument 'file' cannot be empty. Received ''", and the model burns its turn budget retrying the same empty call. Live opencode v44 hit exactly this pattern: cargo init succeeded twice clean, then 2× empty bash, then the model "started fresh" on a wrong version number and burned to max_tokens. The previous `json_schema` body grammar paired with `enforce_min_length_on_required_strings` (`grammar/schema.rs`) enforced min_length 1 at the FSM level. Lifting that check to the validator post-parse keeps the same invariant while letting the grammar body be `any_text` (the native qwen3_coder XML wire format the parser at `tool_parser/parse_single_b.rs:60-132` actually expects). Verification on v45 (atlas-gb10:fp8-grammar-shellfix): - cargo init succeeded byte-exact `/tmp/test-rust-axum-v45`. - read Cargo.toml clean. - write Cargo.toml with `axum = { version = "= 0.8", features=["json"] }`. - One content-loop attractor (`version = { version =` period-2 repeat) correctly caught by the existing watchdog; model recovered next turn. - 0 empty-bash retries — the new validator stops them at the gate. Out of scope (deferred): - The Write tool's `content` parameter is still accepted as empty — F78 only carves out PATH_KEYS. Adding `content` would interact with legitimate "create empty file" intents, so deferred until live evidence shows it's a regression risk. - The MoE FP8 dequant drift identified by the dgx2 per-op cosine study (`bench/fp8_dgx2_drift/MASTER_DRIFT_TABLE.md`, worst op ssm.moe_out @ L20 = 0.91983) — multi-day kernel-level investigation. Co-Authored-By: Azeez Ishaqui <debaterishaqui@gmail.com> Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Empty-required-string validation rejections (F78 path empty, shellfix
command empty) used to emit a `[atlas] Tool call rejected: …` content
chunk and trip `stop_string_triggered`. But `handle_tool_call_start`
had already emitted the `tool_calls[idx]` header to opencode, so
suppressing the args delta left opencode mid-call with no completion.
opencode then reported the cryptic `SchemaError(Missing key)` instead
of its own per-tool error message ("The argument 'file' cannot be
empty. Received ''"), starving the model of actionable feedback.
Soft errors (any validation message containing "non-empty") are
recoverable — emit the args delta as the model produced them, log a
warn, and let opencode's per-tool schema surface its own error.
Hard errors (unknown tool name, args not valid JSON, malformed path
that looks like a directory) still bail with a content chunk because
they cannot be made into a complete tool call at all.
Same treatment applied to `handle_complete_tool_call` for the
non-streaming complete-call path.
Verification (atlas-gb10:fp8-grammar-softrej, opencode v47):
- cargo init succeeded byte-exact `/tmp/test-rust-axum-v47`.
- Subsequent empty-bash retries surfaced opencode's clean error
"The argument 'file' cannot be empty. Received ''" rather than
SchemaError(Missing key).
- Model still drifts into empty-parameter bodies under MoE FP8
dequant precision drift (see `bench/fp8_dgx2_drift/
MASTER_DRIFT_TABLE.md`, ssm.moe_out @ L20 = 0.91983) — recovery
via the cleaner error message alone is insufficient. Sampler-
level enforcement (mask `</parameter>` close when param body has
emitted zero non-whitespace chars) is the next intervention.
Co-Authored-By: Azeez Ishaqui <debaterishaqui@gmail.com>
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Atlas's MoE FP8 grouped GEMM was applying the BF16 block-scale to each
FP8-dequant element BEFORE storing it in SMEM as BF16:
float scale = __bfloat162float(S_exp[n_block * k_blocks + k_block]);
smem_B[k][n] = __float2bfloat16(E4M3_LUT[byte] * scale);
The `__float2bfloat16(LUT * scale)` truncates each scaled weight to BF16's
7-bit mantissa BEFORE the BF16×BF16 MMA, so the scale's precision interacts
with EVERY single multiply-accumulate. Per the dgx2 per-op drift study at
`bench/fp8_dgx2_drift/MASTER_DRIFT_TABLE.md`, this manifested as the worst
single Atlas op cosine vs HF[BF16-unquant]: ssm.moe_out @ L20 = 0.91983
(8% deviation; unacceptable).
The fix is DeepGEMM's two-level accumulation:
1. SMEM stores `__float2bfloat16(E4M3_LUT[byte])` only — lossless because
FP8 E4M3 has 3-bit mantissa and BF16 has 7-bit mantissa.
2. Inner FP32 accumulator collects BF16×BF16 MMA products for one K=128
scale-block (8 K_STEP=16 iterations).
3. At the K-block boundary, multiply the inner accumulator by the
block-scale (one BF16→FP32 conversion per K-block per CTA) and add
to the outer FP32 accumulator.
4. Reset inner accumulator; advance to next K-block.
5. Final BF16 cast in the epilogue from outer_acc.
Net effect: the scale's BF16 truncation now happens ONCE per K=128 partial
sum, instead of being baked into every dequanted weight. The FP8 weight
precision is preserved through the inner MMA reduction.
Per-CTA observation that simplifies the patch: N_TILE=64 < FP8_BLOCK=128
and `cta_n` is always a multiple of N_TILE, so all 64 N-cols of a CTA
fall within a single N-scale-block. One scale lookup per K-block per CTA,
applied uniformly to all 8 n_tile sub-tiles. No per-tile dispatch needed.
Applied to both kernel variants (`moe_fp8_grouped_gemm` v1 with the
original thread-to-smem mapping, and `moe_fp8_grouped_gemm_v2` with the
coalesced-load thread mapping).
**Measured cosine improvement (Atlas vs HF[BF16-unquant], 10382-token
canonical prompt, dgx1 single-node)**:
Metric BEFORE AFTER Δ
L20 ssm.moe_out (worst) 0.91983 → 0.93547 +0.0156
ssm.moe_out mean (n=30) 0.97456 → 0.98133 +0.0068
attn.moe_out mean (n=10) 0.97316 → 0.97806 +0.0049
ssm.out_proj mean (n=30) 0.97813 → 0.98144 +0.0033
layer.hidden_out mean 0.98982 → 0.99219 +0.0024
The 1.56pp lift on the worst single op proves the BF16-scale-truncation
was a real precision sink. The remaining ~0.07 gap to perfect cosine at
L20 will be addressed by stacking complementary patches:
- Native FP8 MMA on SM121 (per agent A6, confirmed available:
`mma.sync.m16n8k32.f32.e4m3.e4m3.f32` lowers to `QMMA.16832.F32.E4M3.E4M3`)
— eliminates the BF16 input-quantization step entirely AND yields
~2× tensor-core throughput. ~100-200 LoC kernel rewrite.
- FP32 residual stream (per agent A8, A1) — replaces `residual_add_rms_norm`
BF16 output with the existing FP32 variant. Reduces input-side drift
compounding across 40 layers.
Co-Authored-By: Azeez Ishaqui <debaterishaqui@gmail.com>
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
12-hour autonomous mission Epochs 1-3. The 11-agent research synthesis
identified three layered interventions to enforce non-empty tool-call
parameter values against the Qwen3.6-FP8 multi-turn drift class:
A (grammar minLength), B (sampler byte counter), C (parser holdback).
This commit ships A + B. The 4 grammar attempts converged on raw EBNF
as the only architecturally-correct primitive:
- `[\s\S]*\S[\s\S]*` (regex sandwich): ε-transitive `\S` in Kleene
closures, FSM skips required content (A3 finding).
- `[^ \t\r\n<][^<]*` (regex `+` with structural prefix): inner
enforcement works but outer `+` quantifier still ε-transitive — let
`</tool_call>` close with zero `<parameter=>` blocks (B6 finding).
- `json_schema` with `style: "qwen_xml"` and `minLength: 1` (B6):
json-schema converter has same ε-edge bug for `[^]{1,}` lowering.
- **EBNF with `value ::= first_char rest` (B5)**: rule INLINING (per
llama.cpp's GBNF compiler) means the parent rule is expanded to
require `first_char` as a single non-quantified terminal — no
ε-transition can skip it. This is what ships.
EBNF body for qwen3_coder XML wire format:
root ::= param ("\n" param)*
param ::= "<parameter=" paramname ">" value "</parameter>"
paramname ::= [a-zA-Z_] [a-zA-Z_0-9]*
value ::= first_char rest
first_char::= [^ \t\r\n<] (forces ≥1 non-WS non-< byte)
rest ::= [^<]*
Belt-and-suspenders Tier-1 sampler-side byte counter (decode_logits_seq.rs):
new ActiveSeq fields `inside_parameter_body: bool` and
`param_body_chars_emitted: u32` track per-sequence state. When inside
a parameter body AND no content tokens have been emitted yet,
logit_bias appends `(510, -8.0)` (`</` close-tag opener) PLUS
`(220, 198, 197, 256, 271, -8.0)` (common Qwen3 whitespace tokens).
emit_step.rs flag flip detects opener via last-8-token signature
`[27, 15704, 28]` ending in `>` (29); decrements on `</` (510);
does NOT count whitespace tokens toward the chars counter (so the
mask survives a sampled whitespace token).
Critical Tier-1 gate: forced_token_fastpath at decode_logits_seq.rs:307
was bypassing the logit_bias by returning the grammar's sole legal
token directly. When `tier1_active` (inside_parameter_body && chars==0)
the fastpath is skipped so the sampler pipeline actually applies bias.
Also enabled `wants_typed_arguments=true` on Qwen3CoderParser so
PR-73's `coerce_all` runs schema-aware type coercion (string → number)
on bash's `timeout` field — addresses opencode's
`SchemaError(Expected number, got "30")` failure mode independent of
the param-body work.
**v57 measured result** (Epoch 3 EBNF + Tier-1 sampler):
- 1 file persisted (Cargo.toml) — best non-precision-stack result
- axum 0.8 dependency cleanly added with json feature
- Atlas log shows model emitted real axum code:
`use axum::{routing::get, Json, Router}; async fn pong() -> Json`
- Remaining gap: model sometimes emits 1-char garbage values
(single backslash) to satisfy the 1-char minimum; need
schema-aware min-length OR Tier-2 closer-suffix holdback.
Co-Authored-By: Azeez Ishaqui <debaterishaqui@gmail.com>
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
12-hour autonomous mission Epoch 4 (final): tightened
validate_single_tool_call so WRITE_FAMILY paths must start with
'/', './', or '../' AND be ≥3 chars, and SHELL_FAMILY commands must
be ≥2 chars after trim. Addresses the model's "satisfy minimum with
single garbage char" failure mode that Epoch 3's EBNF rule-inlining
enforcement exposed (model emitted single backslash `\` for filePath
to satisfy the EBNF ≥1 non-WS non-`<` requirement).
Result: v58 test still scored 0 files because the model now drifts
to entirely-wrong paths like `/test-tu-au-u8` that PASS the loose
"starts with /" validator but bear no resemblance to the requested
target. Confirms: no Atlas-side enforcement can teach the model
TASK SEMANTICS — only structural validity. Multi-turn task
faithfulness gap remains the open problem.
Mission summary written to bench/fp8_dgx2_drift/MISSION_PROGRESS.md.
Best result of the mission: Epoch 3 (raw EBNF) v57 — 1 file
persisted (Cargo.toml with valid axum 0.8 dependency), Atlas logs
captured real axum handler code emitted by the model (`use axum::{
routing::get, Json, Router}; async fn pong() -> Json`). EBNF
rule-inlining is the only architecturally-correct grammar primitive
for non-empty enforcement on Atlas's vendored xgrammar (all 3 prior
regex/json_schema attempts blocked by upstream ε-edge bugs).
Recommended next-mission focus: multi-turn task faithfulness
(FlowKV per-turn KV isolation arXiv:2505.15347, preserve_thinking
chat-template fix per A4 community research, native FP8 MMA to
close remaining 0.04 cosine gap on MoE block).
Co-Authored-By: Azeez Ishaqui <debaterishaqui@gmail.com>
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Checkpoint of the Qwen3.6-35B-A3B-FP8 vs vLLM cosine-parity investigation. Key fixes: - paged_oproj.rs: W8A8 o_proj called fp8_gemm_t_blockscaled with N and K swapped (passed nq*hd as N, h as K). Fix lifted Atlas-vLLM mean residual cosine 0.800 -> 0.997 on short prompts. - quant_helpers.rs: dequant_fp8_blockscaled_to_bf16 now launches a GPU kernel (dequant_fp8_blockscaled_bf16.cu) instead of a D2H + CPU loop + H2D round-trip. FP8 MoE BF16-dequant load drops ~22min -> ~49s (~130x on the dequant step), bit-identical numerics. Diagnostics (env-gated, inert by default): - op_dump hooks for attn_out_pre_gate/post_gate, k_post_norm, k_post_rope, q_post_rope in qwen3_attention prefill paths. - BF16 MoE grouped GEMM + dispatch (ATLAS_FP8_DEQUANT_MOE_TO_BF16), plus bench/fp8_dgx2_drift cosine + op-cosine harness. Findings: MoE GEMM precision is NOT the drift source (BF16 MoE == FP8 MoE at L39 cos 0.952 vs 0.955; NVFP4 MoE worse at 0.868). Remaining 0.955 floor traced to FP8 attention Q/K/V/O projections. Co-Authored-By: Azeez Ishaqui <debaterishaqui@gmail.com> Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Open
…iagnostics Root-caused the opencode agentic-harness "wandering" (shell-fragment dir names, collapsed-newline TOML) via a per-decode WS-mask diagnostic — it was NOT numerical drift but two server-side logit/validation bugs vLLM lacks: - validation.rs: the write-path validator required filePath to start with '/', './' or '../', rejecting bare relative paths like `Cargo.toml` that opencode resolves against --dir (and that vLLM accepts). opencode looped on the rejection and abandoned the task. Now accepts any non-empty path without shell metacharacters/whitespace (still rejects leaked commands like `created && ls -R` — CWE-78). - whitespace_mask.rs: the WS1/WS2 mask suppressed the newline token (198) inside tool-param bodies. ATLAS_WS_MASK_DIAG confirmed the model's pre-mask top-1 was `\n`, demoted to `;`/` &&`/` ||` — the exact mechanism producing shell-fragment paths AND collapsed-newline TOML. Now masks only HORIZONTAL whitespace; line breaks are never suppressed. Added ATLAS_WS_MASK_OFF=1 to disable the whole mask (matches vLLM, which applies no whitespace bias at all). Diagnostics / kill-switches (env-gated, inert by default): - ATLAS_WS_MASK_DIAG=1: logs param-body whitespace argmax flips. - ATLAS_DISABLE_FUZZY_REPAIR=1: no-op the SymSpell tool-arg repair. - ATLAS_FP8_DEQUANT_ATTN_TO_BF16=1: dequant attention Q/K/V/O to BF16 (diagnostic, TP=1; confirmed weight precision is not the drift source). - run_tier.sh --bail: exit on first non-clean run. Co-Authored-By: Azeez Ishaqui <debaterishaqui@gmail.com> Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Generalizes the WS_MASK_DIAG probe into a full per-decode-step JSONL dump for Atlas↔vLLM divergence analysis. Each step records raw model top-K (id,logit) BEFORE Atlas's additive logit-bias stack, the itemized list of every (id,delta) bias applied (WS mask, attractor, A4 think-suppress, C4 lift — none of which vLLM has), the post-bias argmax, and the sampled token, sliced by tool-param-body context. Diffing raw_topk against vLLM's localizes MODEL divergence; the bias list itemizes Atlas-only processor divergence. Env-gated, inert by default; bias is cloned only when active. Co-Authored-By: Azeez Ishaqui <debaterishaqui@gmail.com> Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
On an exact full-prompt prefix-cache hit (matched == total), the engine re-embeds and re-runs the last prompt token through all layers to produce the first generated token's logits. For SSM/Mamba layers that re-run applies the last token's recurrent update a SECOND time on top of the restored snapshot state@N, double-advancing the recurrent state and corrupting both the first-token logits and the state decode reads. Proven by byte-compare: fresh-prefill SSM state == warm-restored SSM state exactly (all layers) — save/restore was never the bug; the re-run was. Intermediate hits (matched < total) recompute the tail from a checkpoint and process each token once, so they were already correct. Fix: the leaf snapshot now also stashes the last token's post-final-norm hidden (hidden_size BF16). On an exact hit, finalize_last re-restores the pristine SSM state@N (undoing the redundant re-run's advance) and feeds the stashed hidden straight to lm_head, skipping any SSM re-run. The redundant 1-token forward is otherwise harmless (its KV write duplicates cached values). Cold/warm output is now byte-identical and warm hits stay fast. Note: prefill_c.rs prefill_full_cache_hit (two-phase / EP=2) shares the same pattern and is a known follow-up; not on the single-GPU prefill_b path. Co-Authored-By: Azeez Ishaqui <debaterishaqui@gmail.com> Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…ool-call drift) Two opencode-quality fixes for Qwen3.6-35B-A3B-FP8, plus diagnostics. MTP 0% draft acceptance — ROOT CAUSE + FIX With ATLAS_FP32_RESIDUAL the main model's hidden stream is FP32, but the MTP head's step-2 input rms_norm read the saved hidden as BF16 → NaN → all-NaN logits → argmax==0 → every draft was token 0 → verify always rejected → 0% acceptance (no speculative speedup). The token embedding (always BF16) normed fine; only the FP32 hidden was misread, which is why prior weight-quant / vocab / KV-dtype attempts all failed (downstream of the NaN). Fix: the MTP hidden-norm now uses the FP32-input rms_norm (rms_norm_f32, BF16 out) when use_fp32_residual, so fp32-residual and MTP coexist. With the fix MTP accepts ~1.67 drafts/step (~83%) and decode rises from ~34 to ~54 tok/s. Also: mtp_head now honors BF16 KV (gated by mtp_quantization) instead of hard-coding FP8 with unit scales, matching the main model. BF16 router (ATLAS_BF16_ROUTER) — tool-call drift The MoE router/gate was NVFP4 (4-bit); FP8 routing flips drove malformed tool calls (empty filePath, mixed <function_calls> tags). Keeping ONLY the router in BF16 (a tiny num_experts×h GEMM; experts stay FP8 = full speed) eliminates the empty_path drift at no throughput cost. Diagnostics (env-gated, inert by default): ATLAS_TOOLCALL_DEBUG (post-parse arg shape on empty-path), ATLAS_MTP_DEBUG_NORMS (staged MTP ‖·‖ to localize NaN/zero), ATLAS_LMHEAD_BF16 / ATLAS_OFFICIAL_TOOL_PROMPT (A/B levers). Co-Authored-By: Azeez Ishaqui <debaterishaqui@gmail.com> Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…RECOVERY) FP8 low-margin token flips corrupt the write tool-call filePath (empty, truncated dir, or hallucinated like /tmp/pure_axioms.txt) while the CONTENT is correct — so the Axum server never lands in src/main.rs, cargo builds the default scaffold (false-positive cargo_valid) and port 3001 never binds. recover_drifted_write_path() salvages the model's correct content to the content-inferred in-project path (Cargo.toml / src/main.rs) when the path is unusable. Env-gated (PCND opt-in); recovers intent, never invents content. Co-Authored-By: Azeez Ishaqui <debaterishaqui@gmail.com> Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Removes ATLAS_FP32_RESIDUAL + use_fp32_residual() and collapses all ~54 call sites to the BF16 branch (residual/hidden buffers, rms_norm / residual-add kernel selection, per-token byte strides). BF16 residual is the shipping config; the FP32 path was incompatible with MTP (it fed the MTP head FP32 hidden it read as BF16 → NaN → 0% draft acceptance). With FP32-residual gone the model is always BF16-residual, so MTP works by default. Orphaned FP32 .cu kernels left in place (unreferenced); the ATLAS_GEMMA4_FP32_LMHEAD scaffolding is now inert (it required FP32 residual) and can be excised in a follow-up. Co-Authored-By: Azeez Ishaqui <debaterishaqui@gmail.com> Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Three additive, content-recovering fixes for FP8 tool-call drift on the opencode rust-axum harness (Qwen3.6-35B-A3B-FP8), all gated behind the existing ATLAS_WRITE_PATH_RECOVERY flag (PCND opt-in, default-off): 1. Narrate-then-tool fence salvage (tool_salvage): when the model emits no native tool_call but renders a whole file in a bare ```rust/```toml fence, extract_fenced now infers the write target from the body's content shape (fn main(->src/main.rs, [package]->Cargo.toml) and synthesises the write(). Plumbs an infer_paths flag from salvage(). 2. Write-content markdown fence-unwrap (validation): when the model dumps its entire markdown answer into a write() content field (code + stray ``` fence + trailing prose / "Principles Applied"), strip the wrapper and trailing prose for code files so the .rs/.toml compiles. 3. Path quoted-comma sanitizer (validation): strip a trailing comma/ws so a JSON-fragment-shaped filePath like `"/tmp/x/Cargo.toml",` is cleaned instead of creating a garbage-named file. SSOT: extracts classify_path_from_content shared by recover_drifted_write_path and the fence salvage. Never invents content — only recovers the body/path the model produced. +8 unit tests (fence contamination, path-comma, salvage inference); no production behaviour change when the flag is off. Together these produced Atlas's first end-to-end webserver_ok=true on the opencode harness. Residual webserver_ok gap is FP8 decode content-completeness (not serialization) — tracked separately under the BF16-MoE quality work. Co-Authored-By: Azeez Ishaqui <debaterishaqui@gmail.com> Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
… BF16 forward_k2/forward_k3 (the fused multi-token MoE forwards used for MTP verify and 2/3-sequence batched decode) only had FP8 and NVFP4 branches. When experts are dequanted FP8->BF16 at load (ATLAS_FP8_DEQUANT_MOE_TO_BF16), the FP8 source weights are freed, so these forwards read freed memory -> garbage verify logits -> degenerate repetition under MTP. Fix: early-return delegate to forward_batched(input, K, ...) when bf16_gate_weight_ptrs.is_some(). forward_batched already has the tested per-token BF16 expert path and writes the same moe_output()[K,H]. SSOT: reuses the decode BF16 kernels, no new fused batch2/batch3 kernel. Verified: MoE-dequant + MTP is now coherent (was repetition), stable across a full N=10 (no slot leak, no crash). Inert for the FP8/NVFP4 paths (bf16_gate_weight_ptrs is None without the dequant flag). Note: the harness-level BF16-MoE quality lift does not beat the FP8 baseline (decode-speed cost + partial quality); see memory project_bf16_dequant_lever. This commit lands the correctness fix regardless. Co-Authored-By: Azeez Ishaqui <debaterishaqui@gmail.com> Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Diagnostic probe tests (eprintln, no assertions) exploring TOML auto-repair behaviour on collapsed-one-line manifests with XML content leaks (`</content>`) and missing-`=` shapes observed in earlier harness runs. Scratch from a prior debugging session, preserved on request. Co-Authored-By: Azeez Ishaqui <debaterishaqui@gmail.com> Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
|
|
||
|
|
||
| def load_config_rms_eps() -> float: | ||
| cfg = json.load(open(SNAP / "config.json")) |
|
|
||
|
|
||
| if __name__ == "__main__": | ||
| sys.exit(main()) |
…ostics Two env-gated, default-off diagnostics for pinpointing async CUDA faults: - ATLAS_DEBUG_SYNC_KERNELS (kernel_args.rs): synchronize the stream after every KernelLaunch::launch so an illegal-address fault surfaces AT the culprit kernel (with grid/block + a force-captured backtrace) instead of at a later, unrelated sync. Env read once via OnceLock — zero prod cost. - ATLAS_DEBUG_NO_GRAPH (impl_a1.rs): force eager batched decode (suppress CUDA-graph capture) so the per-launch sync above is permitted (you cannot cuStreamSynchronize a capturing stream → 900). Together these pinpointed the BF16 o_proj OOB (next commit). Co-Authored-By: Azeez Ishaqui <debaterishaqui@gmail.com> Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…_BF16
ms_phase_o_proj (multi-seq batched-decode O projection) had FP8 and NVFP4
branches but NO BF16 branch. Under ATLAS_FP8_DEQUANT_ATTN_TO_BF16 the o_proj
weight is dequanted to BF16 (o_dense_bf16), so o_weight.as_fp8() returns None
and the dispatch fell through to the NVFP4 w4a16_gemv_batch{2,3} path using the
stale FP8/NVFP4 self.attn.o_proj — reading mismatched weight bytes →
CUDA_ERROR_ILLEGAL_ADDRESS (700) in batched decode (n>=2 / MTP).
Fix: add the o_dense_bf16 branch first (per-token dense_gemv), mirroring the
single-seq decode path (attention_forward_oproj.rs). Inert for FP8/NVFP4
(o_dense_bf16 is None without the dequant flag).
Verified: full bit-perfect config (MoE+attn BF16 + MTP) now runs coherent
batched decode with zero CUDA-700 (was: crash). Enabling fix for the BF16
webserver_ok lever (#15).
Co-Authored-By: Azeez Ishaqui <debaterishaqui@gmail.com>
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Layers a freshly-built spark binary onto the prior alpha image (FROM a local atlas-gb10 tag) to skip the multi-minute CUDA build during iteration — same pattern as Dockerfile.fence/.fast. Used to build the BF16 dequant / debug-sync diagnostic images this session. Base tag is local-only; not reproducible on a fresh clone (matches existing fast-layer Dockerfiles). Co-Authored-By: Azeez Ishaqui <debaterishaqui@gmail.com> Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
New per-layer gate for the FP8→BF16 dequant path: ATLAS_FP8_DEQUANT_LAYERS takes a comma-separated list of singletons/inclusive ranges (e.g. "31-39" or "31,35,39") of absolute layer indices. Unset → all layers (legacy behaviour, no change). Gates both ATLAS_FP8_DEQUANT_MOE_TO_BF16 and _ATTN_ via a single layer_dequant_selected() helper (parsed once via OnceLock). Motivation: full-BF16 dequant lands the best per-token quality but decode is ~30% slower (2× MoE weight bandwidth → 360s harness timeouts; task #231). Restricting BF16 to the worst-drift late layers (L31-39) keeps the bulk FP8- fast: measured 44.7 tok/s (vs full-BF16 33.1, FP8 47.7) — ~94% of FP8 speed — while reaching cargo_valid 10/10 on the opencode harness (best of any tier; FP8=8/10, full-BF16=9/10, =vLLM 10/10). webserver_ok stays ~0-1/10 across all quant configs: confirmed NOT a quant- quality or decode-speed problem — it's the agentic content-completeness ceiling (model reliably emitting a complete binding server), per the gap forensic. Co-Authored-By: Azeez Ishaqui <debaterishaqui@gmail.com> Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
FP8 agentic failure mode (forensic gap #9): the model explores (bash ls/cat/ find, read, glob) or narrates across many turns but never writes the deliverable, so the run ends with a valid Cargo.toml but no real src/main.rs — webserver_ok never fires. BW1 tallies tool-call productivity across the conversation (write/edit tools + bash write/build/run verbs = productive; ls/cat/find/grep/read/glob = exploration) and, when the agent has made many calls with zero productive output, appends a steering nudge to the most recent tool response redirecting it to write the file(s) and verify. Escalates to a CRITICAL nudge past ~9 calls. Lives in hint_injector (alongside the error-recovery hints); injected at msg_entry after the conversation scan. Env-gated ATLAS_BASH_WANDER_WATCHDOG=1 (PCND, default-off). +2 unit tests (classifier + threshold/escalation). Targets the only remaining webserver_ok lever after the quant/speed track (cargo_valid 10/10 via selective L31-39); forensic est +0..+2, partly model-inherent. Co-Authored-By: Azeez Ishaqui <debaterishaqui@gmail.com> Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…sweeps) N=10 run JSONs + aggregate reports for the ws_ok lever investigation: fencecontent/pathrec baselines, bf16moe/bf16both/bf16sel selective-BF16 levers, bw1_sel anti-wander watchdog (confirmed no-op), vLLM 13/13 reference. Co-Authored-By: Azeez Ishaqui <debaterishaqui@gmail.com> Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…ments in Cargo.toml extract.rs: a bare ``` at end-of-content with no trailing newline made body_start = content.len()+1, panicking the slice at extract.rs:121 and crashing the salvage worker (observed live in disable-thinking opencode runs) — so that request's recovered write never landed. Guard body_start > len. toml_repair.rs: the FP8 model sometimes annotates Cargo.toml with C-style `//` comments (TOML uses #), e.g. `name = "x" // not real TOML`, making the file unparseable. Add a conservative char-safe stripper (only outside double-quoted strings, so URLs/# are untouched) as a repair candidate. Both unit-tested (toml_repair 11/11; extract EOF-fence regression). Found during the 12h /loop while investigating the Atlas-vs-vLLM webserver_ok gap. Co-Authored-By: Azeez Ishaqui <debaterishaqui@gmail.com> Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…lling the agentic verify loop The spinning detector counted any assistant turn with content<500 AND tool_args<100 as 'short'; 5 consecutive → suppress_tool_call (-12 bias on <tool_call> start). But in an agentic coding loop the verify cycle (bash cargo build / cargo run / curl, read, small edit) is a run of legitimately short-arg tool calls — so it tripped after ~5 turns and hard-masked the next tool call, killing the build→error→fix→rebuild loop. Measured: Atlas capped at ~4-5 turns / 2 bash vs vLLM's 12-17 turns / 8 bash on the same task (vLLM has no such suppressor → 10/10 webserver_ok). Fix: a turn that issued ANY non-empty tool call is taking an action (progress), so it is substantial — spinning now only fires on consecutive short PURE-TEXT turns. Genuine repeated-tool-call loops remain caught by loop_detector::detect (the Suppress verdict). Live result: turns 4→9-15, bash 2→5-7, 'Spinning detection fired' → 0, builds now succeed, and Atlas landed its FIRST webserver_ok=true on the opencode harness (real axum server, /ping→pong on 0.0.0.0:3001). Co-Authored-By: Azeez Ishaqui <debaterishaqui@gmail.com> Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…under-covers a full-prompt match #226 warm-hit SSM corruption: when an intermediate checkpoint matched at full prompt length (ssm_snapshot_tokens < matched == total — e.g. the leaf snapshot was LRU-evicted from the 16-slot pool, leaving only a block-aligned checkpoint), prefix_lookup.rs forced skip_tokens=matched, restoring SSM state to snap_tok but advancing positions/KV to total → stale recurrent h_state/conv_state by (total-snap_tok) tokens → first decoded token reads misaligned state → healthy logits but wrong → immediate stop / empty completion. Clusters in later N=10 runs (warm cross-request cache). Fix: skip only to snap_tok so suffix-prefill replays SSM over the gap (mirrors the matched<total intermediate path); exact-leaf hit (snap_tok==matched==total) still routes through the 25f8bbe fixup. Same guard added to the EP=2 two-phase path (prefill_c.rs). Live N=10 (cache ON): give-ups 4→1, ws_ok 2→3/10. PARTIAL — a residual warm-hit path remains (cache-OFF still best at 6/10). Correct direction (fix corruption vs disable caching); residual is a known follow-up. run_tier.sh: OC_TIMEOUT env knob (default 360) for de-confounding runs. Co-Authored-By: Azeez Ishaqui <debaterishaqui@gmail.com> Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…pro tooling + bench evidence prefix_lookup.rs: env-gated (default OFF) bypass of the exact-leaf snapshot shortcut — routes exact full-prompt hits through full recompute. Used to rule out the marconi_exact_snap fixup as the cache-ON regression cause (bypass still 2/10). Inert in production. Adds prefix_partial_hit_repro.py + N=10 run JSONs from the prefix-cache investigation (spinfix/pfxfix/pfxfix2/exactprobe/cacheoff2/bf16* tiers). Findings (memory project_gaps_to_10of10): single-forward partial-hit is bit-exact (verified, identical-prompt cold-vs-warm); harness cache-ON regression (23% vs cache-OFF 65%, z=3.2) is real but lives in the multi-turn/eviction/batched regime — next: instrumented multi-turn repro. Co-Authored-By: Azeez Ishaqui <debaterishaqui@gmail.com> Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
<tool_call> leak (Qwen3.6 + opencode)…og hardening
First clean Atlas-FP8 webserver_ok (atlas_capfix run2: build+bind+ping=pong, 9 turns,
309s). Snapshot of the fix/in-think-tool-call-leak working tree at the capfix milestone
(the exact state atlas-gb10:capfix was built from).
- F1: unconditional per-generation post-think content-token cap
(max_post_think_content_tokens, default 100_000 no-op; qwen3.6 MODEL.toml=1536),
plumbed atlas-kernels -> WatchdogParams; caps in emit_step.rs + decode_logits_content.rs
- F4: sticky tool_request bool so the inter-tool prose-budget survives grammar disengage
- F5: max_inter_tool_prose 2048 -> 1024 (qwen3.6 MODEL.toml only)
- F2: bounded value-rest rest_part{0,6000} behind ATLAS_GRAMMAR_VALUE_HARDEN (default OFF)
- harness: aggregate.py/run_tier.sh exit code = total cargo+webserver failure count
Revert target (pre-session committed base): 7e8e2d6
Co-Authored-By: Azeez Ishaqui <debaterishaqui@gmail.com>
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
… serialization parity ROOT CAUSE of the opencode webserver_ok gap (Atlas-FP8 ~0/10 vs vLLM-FP8 10/10 on IDENTICAL FP8 weights): repetition_penalty=1.1 in MODEL.toml [sampling.*]. At the harness temp=0, the frequency penalty (256-token window) suppresses the most-recurrent code tokens (newline id198, `::` id476) and flips low-margin greedy picks -> collapsed newlines / `::`-splitting / prose-wander / runaway-to-length. vLLM runs penalty-free greedy so never flips -- which is why the *more precise* engine was the failing one. NOT a precision floor: Atlas-vs-BF16 cosine 0.99523 > vLLM 0.98482 at 40/40 layers. - kernels/gb10/qwen3.6-35b-a3b/MODEL.toml: repetition_penalty 1.1->1.0 in all four [sampling.*] presets; dry_multiplier=0.5 KEPT (DRY catches the verbatim filename runaway that prompted the prior revert; rep_penalty is the code-corruptor). Note the selector (sampling_setup.rs:53) uses tools|thinking_text|non_thinking; thinking_coding is currently dead but set to 1.0 too. - tool-JSON: serde_json + minijinja `preserve_order` + a custom spaced `tojson` filter so the <tools> block byte-matches transformers/vLLM (Cargo.toml, spark-server/Cargo.toml, tokenizer.rs, jinja_helpers.rs). Result: norep2 N=10 = 8/10 webserver_ok, 9/10 cargo_valid (was 0/39 across all configs this session). Also re-enables the fast GPU-greedy decode path (gated off when repetition_penalty != 1.0). Also carries prior env-gated additions (default-off, inert unless enabled): ATLAS_LOOP_NO_SUPPRESS (loop_detect.rs), ATLAS_TOOL_SHORT_TRIGGER (compile_tools.rs), qwen3_coder leak-markers (qwen3_coder.rs). Co-Authored-By: Azeez Ishaqui <debaterishaqui@gmail.com> Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…ids, unify logit paths, fix >=-merge grammar boundary Drives Qwen3.6-35B-A3B-FP8 opencode webserver_ok from 8/10 to a clean 10/10 with MTP on (BF16-KV, N=10), and proves the long-standing gap was never an "FP8 precision floor" but a stack of Atlas-specific band-aids plus a tokenizer grammar-boundary bug. Grammar (compile_tools.rs): - value ::= leading_ws* first_content rest, so a leading newline (the model's genuine top-1 at content-start) is no longer masked into a lean/cargo drift runner-up; keeps the non-empty guard. - Exclude =/> from first_content. The param key closes with '>' and the tokenizer fuses it into >X merge tokens (>= is id 9628); at the <parameter=KEY>->value boundary the model could emit >=, depositing a phantom '=' as the value's first char and breaking edit oldString matches (the "give-up": 3 failed edits -> watchdog kill of a coherent agent). Excluding =/> makes xgrammar reject the merge token at the boundary. Remove Atlas-only band-aids (no analogue in vLLM/SGLang/TRT-LLM): - AM1 lean-attractor logit bias + pos-0 close-tag bias + tier1 fast-path bypass + the attractor_mask module. - Whole tool-call repair/salvage stack: toml_repair (SC1), fuzzy_repair (A2/SymSpell), tool_retry (Tier5c), tool_salvage (narrate->tool), repair_json, and the validation.rs ATLAS_WRITE_PATH_RECOVERY block. - Whitespace mask (WS1/WS2). - Dead env vars: ATLAS_DISABLE_FUZZY_REPAIR, ATLAS_WRITE_PATH_RECOVERY, ATLAS_TOOL_RETRY. Unify the two per-token logit paths (SSOT): the non-MTP decode path was a stale inline copy of run_pipeline; both now route through process_position_logits. Fixes A4 being silently dead on the MTP verify path; B1 margin detector moved to b1_margin.rs (FinalDecode-gated). Three guard tests make the "logit rule on one path only" regression impossible. MODEL.toml: dry_multiplier 0.5->0.0 (unlocks the on-GPU fast-greedy path, ~30->48 tok/s on BF16-KV; rep_penalty already 1.0). SSM slot-leak-on-abort fix (RAII SlotGuard, ssm_pool.rs) so no-MTP runs do not brick. Result: MTP-on BF16-KV N=10 = 10/10 webserver_ok, 0 phantom-= corruptions, all drift markers 0, 49 tok/s. spark-server 459 tests + spark-model 69 green. Co-Authored-By: Azeez Ishaqui <debaterishaqui@gmail.com>
…lit-K negative result ATLAS_DECODE_TIMING=1 (gated, zero-cost off): per-token host-path stage timers in process_decode_logits splitting copy+forward-wait vs the 248k-vocab host sampling loops. Measured: host sampling is ~0.68ms/tok (negligible); the per-token decode cost is the GPU forward. split_ref_seqs: reverted a single-stream split-K unpin experiment — a clean A/B (same 21.8k code task) was byte-identical (12.7 tok/s both), confirming attention occupancy is NOT the long-ctx decode bottleneck. Determinism pin kept intact; negative result documented in-code. These localize the real bottleneck (separate kernel work): prefill is ~20x slower than vLLM, owned by the naive W8A16 projection GEMM (38%) + MoE FFN GEMM (40%) at 1.4-7 TFLOP/s, NOT the GDN scan (11%). Co-Authored-By: Azeez Ishaqui <debaterishaqui@gmail.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
The bug. Live-repro'd against opencode driving
Qwen/Qwen3.6-35B-A3B-FP8on Atlas with 9 tool schemas,temperature=0.3,thinking_in_tools=true. The model emits literal<tool_call><function=…><parameter=…>…</parameter></function></tool_call>XML inside its<think>block, repeated ~6 times before the thinking budget force-closes</think>. The literal XML reaches opencode's rendered transcript asreasoning_contentdeltas and visibly contaminates the agentic session. Reproducible across multiple sessions; see/home/nologik/opencode-session.mdand/workspace/atlas-dumps/opencode.jsonl.Root causes (3-agent Explore + 1 Plan):
helpers.rs:70-79) — the leaked XML is ~50 tokens/occurrence, outside the detector.tool_call_start_tokenhard-mask atdecode_logits_seq.rs:137-143is a no-op when<tool_call>is multi-token.decode_logits_seq.rs:261-273).handle_token.rs:91-112is per-delta and can't stop the model from generating the next repetition.temp=0.3, top_p=1(vs preset0.6/0.95); DRY disabled; 9 tool schemas in context apply prior pressure.MTP is not a cause (drafts are greedy argmax; F62 SSM rollback safe; leak repros pre-MTP-enable).
Fix — two layers
Layer D (hotfix, 1 line)
kernels/gb10/qwen3.6-35b-a3b/MODEL.toml:thinking_in_tools = true → false. With thinking off during tool-active turns the<think>block never opens. Re-enable once Layer A is validated end-to-end.Layer A (real fix, ~75 LoC in 2 files)
Cross-delta scanner in the chat_stream thinking branch. On match: drop the delta, set
tool_loop_capped, setstop_string_triggered, setcancel_flag— scheduler finalises within one decode step (PR #89),handle_doneemitsfinish_reason="length"via the PR #87 override.Files:
crates/spark-server/src/api/chat_stream/state.rs—reasoning_xml_scan_buf(rolling 256-char tail) +reasoning_xml_leak_detectedone-shot.crates/spark-server/src/api/chat_stream/handle_token.rs— scanner + short-circuit guard in the thinking phase only.Triggers:
<tool_call>,<function=,<parameter=,<invoke(Qwen3-Coder, Hermes, MiniMax-XML variants). Gated onctx.tool_defs_for_backfillnon-empty so a benign chat that mentions<tool_call>does not trip.Layer B (deferred)
Watchdog period extension — needs codegen plumbing through 6 files for new MODEL.toml
[behavior]keys. Tracked separately; Layer A alone closes the leak.Verification
cargo check,cargo clippy --tests,cargo fmt --check,cargo test -p spark-server(484 passed).handle_token.rs476 LoC (≤500 cap).jq 'select(.seq==1)' /workspace/atlas-dumps/opencode.jsonl. Expect: no<tool_call>/<function=/<parameter=substrings inreasoning_content; terminal SSEfinish_reason: "length"; the newtracing::warn!line appears in server logs.Docker Hub push held until user re-runs an opencode session and confirms the leak is gone.
Stacked on top of PR #89 (cancel_flag plumbing) — base branch is
fix/scheduler-cancel-flag's landing point. Once #89 merges to main this will rebase cleanly.